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Abstract 

We present a generalization of the wide baseline two view matching problem - WxBS, 
where X stands for a different subset of “wide baselines" in acquisition conditions such as 
geometry, illumination, sensor and appearance. We introduce a novel dataset of ground- 
truthed image pairs which include multiple "wide baselines" and show that state-of-the- 
art matchers fail on almost all image pairs from the set. A novel matching algorithm 
for addressing the WxBS problem is introduced and we show experimentally that the 
WxBS-M matcher dominates the state-of-the-art methods both on the new and existing 
datasets. 



^ Introduction 

cd 

The Wide Baseline Stereo (WBS) match¬ 
ing problem, first formulated by Pritch¬ 
ett and Zisserman m, has received sig¬ 
nificant attention in the last 15 years [123, 
BD]. Progressively more challenging two- 
and multi-view problems have been suc¬ 
cessfully handled [BB] and recent algo¬ 
rithms mi rni have shown impressive 
performance, e.g. matching views of planar 
objects with orientation difference of up to 
160 degrees. 

Besides the orientation and viewpoint 
baseline, other factors influence the com¬ 
plexity of establishing geometric corre¬ 
spondence between a pair of images. The 


Figure 1: Examples of WxBS problems. 


standard physical models of image formation and acquisition consider, beside geometry, the 
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effects of illumination, the properties of the transparent medium light rays pass through in 
the scene, the surface properties of objects and the properties of the imaging sensors. 

In the paper, we consider the generalization of Wide (geometric) Baseline Stereo to 
WxBS, a two-view image matching problem where two or more of the image formation and 
acquisition properties significantly change, i.e. they have a wide baseline. The "significant 
change" distinguishes the problem from image registration, where dense correspondence 
is routinely established between multi-modal images and various complex transformations 
have been considered, see Zitova and Flusser [SS]. Operationally, the "wide baseline" means 
"where local, gradient-descent type" methods fail. 

The following single wide baseline stereo, or correspondence, problems and their com¬ 
binations are considered: illumination (WlBS) - difference in position, direction, number, 
intensity and wavelength of light sources; geometry (WgBS) - difference in camera and 
object pose, scale and resolution - the “classical” WBS; sensor (WsBS) - change in sensor 
type: visible, IR, MR; noise, image preprocessing algorithms inside the camera, etc; appear¬ 
ance (WaBS) - difference in the object appearance because of time or seasonal changes, 
occlusions, turbulent air, etc. We denote matching problems, or, equivalently, image pairs, 
with a significant change in only one of the groups listed as WlBS; if a combination of 
effects is present, as WxBS. To our knowledge, almost all published image datasets and 
algorithms are in the WlBS class[IZ3], [IZ3], [S3],[Q],[II3], [O]. 

We present a new public dataset with ground truth which combines the above-mentioned 
challenges and contains both W2BS image pairs including viewpoint and appearance, view¬ 
point and illumination, viewpoint and sensor, illumination and appearance change and WSBS 
- problems where viewpoint, appearance and lighting differ significantly. 

We show that state-of-the-art matchers performs poorly on the introduced image match¬ 
ing pairs, and propose a novel algorithm which significantly outperforms the state-of-the-art 
without a dramatic loss of speed. 

The paper is organised as follows. In Section 2, relevant datasets and matching algo¬ 
rithms are reviewed. The novel WxBS matching algorithm is then introduced in Section 4. 
The dataset for WxBS problems and the associated evaluation protocol are presented in Sec¬ 
tion 3. Experimental results are described in Section 5. The paper is concluded in Section 6. 


2 Related Work 

Viewpoint change. The stereo problem - matching of two images taken from different 
viewpoints - has always received significant attention of the computer vision community as 
it is a critical component of the structure from motion task. For images taken concurrently, 
in both the calibrated and uncalibrated set up, the problem for a narrow baseline is mature 
[SB] and can be now solved in real-time and on a large scale [□]. 

For wide-baseline matching, the standard evaluation protocol focuses on the feature de¬ 
tection and description stages [IZ3]. However, the methodology and datasets of [IZ3] are lim¬ 
ited to images related by a homography. Attempts have been made to extend the evaluation 
to 3D scenes [III, m, but they are significantly less popular. Neither of the above-mentioned 
protocols evaluates the performance of the matching stage and thus of the full matching 
pipeline. 

As a reference, we adopted two recent algorithms which reported good performance and 
whose binaries are freely available. The ASIFT method [123] method synthetically transforms 
images in order to improve the range of affine transformations of the DoG detector. This 
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idea have been further extended in MODS [123] which incorporates multiple detectors and 
adopts an iterative approach that attempts to minimize the matching time. Both algorithms 
are able to match images with extreme viewpoint changes. Mishkin et al. [123] introduced 
an extreme-viewpoint dataset that is used to test the ability of the newly proposed WxBS 
matcher to handle viewpoint changes. 

Multimodal image analysis is needed for the alignment of images acquired by different 
sensors. Most commonly, the problem is encountered in remote sensing and in medical 
imaging. For instance, in [O], red-free and fluorescein angiographic images are matched. 
Similarly for different modes of magnetic resonance imaging, modality of the captured data 
depends on the magnetic properties of the scanned chemical compound. In remote sensing, 
multimodal matching involves, e.g. registering visual spectrum images against near infrared 
images (NIR) or Long-Wave infrared (LWIR). 

Multimodal registration methods are usually divided to area-based and feature-based 
methods. As we are interested in extending the challenges into multiple-baseline variations, 
area-based methods are omitted as they lack scale invariance [O]. 

Feature-based approaches [O] and [O] identify the main issues of existing algorithms 
in the context of multimodal matching as the selection of the the response threshold, i.e. 
the minimal image contrast which triggers the detector. In [O], the Difference of Gaussian 
(DoG) [El] response is normalised by local average image intensity in cases when the image 
contrast is low. Ghassabi et al. [O] present a variant of the DoG detector which sets a 
local response threshold for each image cell on the basis of the image entropy. In [□], it is 
argued that Harris detector is more suitable for this task as the information along boundaries 
is preserved in cases of different image modalities. 

The main issue of the widely used SIFT descriptor [El] in the context of multimodal 
images is the lack of invariance to gradient reversal. Two approaches to address this issue 
have been proposed in the literature. The first generates a second SIFT descriptor of the 
feature for a gradient reversed image by SIFT vector reordering [O]. We refer to this method 
as inverted-SIFT. The second method [□], denoted as half-SIFT, limits local image gradients 
directions to (0, Tl) by merging opposite gradient directions in orientation estimation. Unlike 
the inverted-SIFT, this method allows matching of images that are only partially inverted (per 
patch),some gradient directions stay the same while other are reversed. The downside is 
the reduction of the descriptor discriminability. 

The computation of inverted-SIFT has a negligible computational cost, as it can be gen¬ 
erated from SIFT descriptors by rearranging the data in the gradient histogram. The only 
associated computational cost is in the matching since twice as many features are matched 
in the second image. For the half-SIFT method, the feature patch and its descriptor has to be 
extracted as the dominant feature orientation differs from SIFT’s dominant orientation. 

An example of a multimodal image registration dataset is presented in [□]. This dataset 
consist of 100 pairs of vertically aligned images from a camera and a LWIR thermal sensor. 
The viewpoint changes between related image pairs are negligible. 

Change in object illumination and appearance. Techniques similar to those developed for 
multimodal image matching can be used for matching of images of differently illuminated 
objects. In [O], the authors employ half-SIFT and further modify SIFT descriptor in such a 
way that it collects only gradients located on edges. Yang et al. [S3] use the Difference of 
Gaussian features and SIFT to estimate the transformation between the images. If no matches 
are found, an identity transformation is assumed. From a single local match, multiscale 
features together with local image statistics are used in an iterative procedure called Dual- 
Bootstrap to enlarge the region of good alignment. A data presented in [O] are used in 
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Section 5. 

Hauagge et al. [lEO] argue that local symmetries survive significant illumination changes 
and developed a higher-level feature detector for matching of urban scenes where symmetries 
are abundant. They also assume that the vertical direction is aligned with one of the edges 
of the image. The method proposed in [HB] is able to match images of architectural objects 
taken many years apart and even sketches to photos. The dataset introduced in the paper 
contains 46 pairs of images. 

Matching of images depicting very different appearance of the same object arise in com¬ 
puter vision applications. A system for guided drawing of free-form objects called Shadow- 
Draw is presented in [ED]. It can be seen as a large-scale image retrieval system which inter¬ 
actively tries to look for images based on sketches given by a user. In the object classification 
field, the multiple-appearance problem has been investigated in [EB] who train a data-driven 
visual similarity measure in order to match images to sketches or paintings. Those two ap¬ 
proaches use global image description rather than local image feature matching. 

3 Datasets 

Datasets used in experiments are listed in Table 1. When evaluating detectors (Section 5) and 
the proposed matching algorithm (Section 4) all dataset images are used. However, descrip¬ 
tor evaluation is performed only on a subset of the most challenging and prominent pairs (i.e. 
only pairs 1-6 from OxfordAffine) with provided homography of each WxBScategory. 

Most of the published datasets (with exception of the LostInPast dataset [O]) include 
only a single nuisance factor per image pair. This is suitable for evaluation of the robustness 
to a particular nuisance factor but fails to predict performance in more complex environ¬ 
ments. One of the motivations of the proposed WxBS datasets is to address this issue. 

Table 1: Datasets used for evaluation 


Short name Proposed by #images Type 

■®B Kelman et al. [O], 2007 22 pairs WlBS, WsBS 

SymB Hauagge and Snavely [US], 2012 46 pairs WaBS, WlBS 

MMS Aguilera et al. [□], 2012 100 pairs WsBS 

EVD Mishkin et fl/. [EB], 2013 15 pairs WgBS 

OxAff Mikolajczyk et al. [E3], [□], 2013 8 sixplets WgBS 

EF Zitnick and Ramnath et al. [E3],2011 8 sixplets WgBS,WlBS 

Amos Jacobs et a/. [O], 2007 > lOOK WlBS,WaBS 

VPRiCE VPRICE Challenge 2015 [EB] 3K pairs WgaBS, WglBS,WgsBS, 

Past Fernando et fl/. [D], 2014 502 images WgaBS 

WxBS here 37 pairs WaBS,WgaBS,WglBS, WgsBS,WlaBS,WgalBS 


WxBS dataset and evaluation protocol. A set of 37 image pairs has been collected from 
Flickr and other sources. The dataset is divided into 6 categories based on the combinations 
of nuisance factor present, see Table 2. For every image, a set of approximately 20 ground- 
truth correspondences has been annotated. Selected examples are presented in Figure 2. The 
resolution of the majority of the images is 800 x 600 with the exception of LWIR images 
from the WgsBS dataset which were captured by a thermal camera with a resolution of 
250 X 250 pixels. The selected image pairs contain both urban and natural scenes. 

Ground truth and the evaluation protocol. In the image registration tasks, it is often 
sufficient to define ground truth as a homography between an image pair. However, the 
WxBS dataset contains significant viewpoint changes. In the case of a non-planar scene a 
homography can, at best, cover the dominant plane. 

We assume that an ideal algorithm matches the majority of the scene content, thus our 
ground truth is a set of manually selected correspondences which evenly cover the part of 
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Table 2: The WxBS datasets categories 


Short name 

Nuisance 

#images 

Avg. # GT Corr. 

map2ph 

appearance (map to photo) 

6 pairs homography provided 

WgaBS 

viewpoint, appearance 

5 pairs 

22 per img. 

WglBS 

viewpoint, lighting 

9 pairs 

21 per img. 

WgsBS 

viewpoint, modality 

5 pairs 

18 per img. 

WlaBS 

lighting, appearance 

4 pairs 

25 per img. 

WgalBS 

viewpoint, appearance, lighting 

8 pairs 

17 per img. 




a) WgaBS (5 pairs) 


b) WgsBS (5 pairs) 


c) WlaBS (4 pairs) 






WglBS (9 pairs) 


e) WgalBS (8 pairs) 


Figure 2: Examples of image pairs from the WxBS dataset. 

the scene visible in both images. The average number of correspondences per image pair is 
shown in Table 2. 

The evaluation protocol for the WxBS dataset. For each image pair indexed with i G Z 
we have manually annotated a set of correspondences (u/,V/) G Q where u and v are posi¬ 
tions in the and the 2^^ image respectively. For epipolar geometry we use the symmetric 
epipolar distance and the symmetric reprojection error for homography [O]. 

Recall on ground truth correspondences Q of image pair i and for geometry model is 
computed as a function of a threshold 6 

„ |K,V;) : (u;,V;) GCi,e(M;,U,v) < 0| 

r.-.M,(0) =-^- (1) 

using appropriate error functions. For all pairs of each category W we define an overall recall 
per category as: 

This measure is as the fraction of the confirmed annotated correspondences for a given 
threshold in a nuisance category. 
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4 Matching algorithm for wide multiple baseline stereo 


Algorithm 1 MODS-WxBS - a matcher for 
wide multiple baseline stereo 

Input: hNi- two images; Om - minimum required num¬ 
ber of matches; ^Smax - maximum number of iterations. 
Output: Fundamental or homography matrix F or H; 
a list of corresponding local features. 

while (A^matches ^ ^m) (Iter < jSniax) do 
for I\ and h separately do 

1 Generate synthetic views according to the 
scale-tilt-rotation-detector setup for the Iter. 

2 Detect local features using adaptive thresh¬ 
old. 

3 Extract rotation invariant descriptors with: 

3a rSIFT and 3b hrSIFT 

4 Reproject local features to h . 
end for 

5 Generate tent, corresp. based on the first geom. 

inconsistent rule for rSIFT and hrSIFT 
separately using kD-tree 

6 Filter duplicates 

7 Geometric verification of all TC with modified 

DEGENS AC estimating F or H. 

8 Check geom. consistency of the LAFs 
with est. F. 

end while 


In this section, we propose a variant of 
MODS [E3, IZ3] matcher designed for 
WxBS problems called WxBS-MODS, or 
WxBS-M in short. Its overall structure is 
shown in Algorithm 1. The view synthe¬ 
sis is identical to the original MODS frame¬ 
work [IZ3]. 

Tentative correspondences are gener¬ 
ated using kD-tree [ED] and the 1st ge¬ 
ometrically inconsistent rule with radius 
equal 10 pixels as threshold is applied[IZa]. 
Descriptors from different detectors types 
(Hessian, MSER-f, MSER-) as well as for 
different descriptors are put in seperate kD- 
trees. After matching, all tentative corre¬ 
spondences are put into a single list and du¬ 
plicates, which appears due to view synthe¬ 
sis, are filtered if features in both images are 
within a 3 pixel radius. 

5 Evaluation 

of description 

and detection algorithms 


In this section, multiple detection and description algorithms are evaluated. 

Descriptors evaluation. The evaluation protocol is as follows. The dataset consists of 40 
image pairs from datasets listed in Table 1 divided into 5 parts by the nuisance factor. Eor all 
pairs, homography is the appropriate two-view relationship - the images are either without 
significant relative depth of taken from virtually identical viewpoints. In order to minimize 
bias towards a specific detector, affine-covariant regions by Hessian-Affine, MSER and EOCI 
in the first - least challenging image of the pair are used (visible in case of IR-vis, day on 
day-night, frontal when view point changes, etc.). The affine-covariant regions have been 
detected with dominant orientation and then reprojected to the second image by the ground 
truth homography. Eeatures which are not visible in the second image have been discarded. 
Therefore geometric repeatability of affine regions on the selected regions is always 100% 
and the maximum possible recall is 1. Color-to-grayscale image transformation have been 
done via channel averaging, which gives best matching performance [HE]. 

Then affine regions were normalized to patch size 41x41 (scale G = 3\/3) and described 
with given descriptors. An affine-normalization procedure is performed even for the fast bi¬ 
nary descriptors, which is rarely used because of the significant additional processing time. 
However, the goal of our experiment is to explore descriptor performance in challenging 
conditions, not their speed. The procedure helps - the typical threshold of the Hamming dis¬ 
tance for binary descriptors on unnormalized patch is around 60-80, while on affine normal¬ 
ized patches similar performance is obtained with a threshold around 10-30. All descriptors 
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Figure 3: First row: descriptors computed using authors’ implementation, second row - de¬ 
scriptors computed on photometrically normalized patches (mean = 0.5, var = 0.2) patches 
as done in SIFT. Third row: top 5 complementary pairs of descriptors (photometrically nor¬ 
malized). The numbers in legend are mean average precision. Bottom row: examples of the 
image pairs from each subset. Note that axis scales differs in each column, i.e. for different 
WxBS problems. 

clearly benefit from the affine-normalized process, e.g. the graffiti 1-6 pair from the Ox- 
fordAffine dataset could be matched with FREAK descriptor only when using a normalized 
patch. 

The tested descriptors are: SIFT [ED], rSIFT [B], hrSIFT (gradients in interval [0; %)) [O], 
InvSIFT (SIFT with reordered cells as for inverted image) [O], LIOP[Sl], AKAZE [S], 
MROGH [mi FREAK [i], ORB [E3], SymFeat [Ml SSIM [E3] (implementation [S]), 
DAISY [E3] and L 2 -normalized raw grayscale pixel intensities. Floating point descrip¬ 
tors have been compared using L 2 distance, binary using Hamming distance. The Recall- 
Precision curves are shown in Figure 3. The second-nearest distance ratio is used to param¬ 
eter the curve for floating point descriptors, the Hamming distance for binary ones. 

Note that most of the descriptors gain significantly from photometric normalization, cf. 
the first two rows of Figure 3. The published implementations are clearly sensitivite to 
contrast variations. 

The results hows that gradient-histogram based SIFT and its variants including DAISY 
are the best performing descriptors by a big margin in the presence of any (geometric, illumi- 
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nation, etc) nuisance factors despite the fact that some of the competitors - LIOP, MROGH - 
have been specifically designed to deal with illumination changes. The second best descrip¬ 
tor is - surprisingly - the patch with contrast-L 2 -normalized pixels, which beats all other 
descriptors. It has huge memory footprint - 1681 fioats, but the affine-photo-L 2 -normed 
grayscale pixel intensities are a strong descriptor baseline. 

Most of descriptors, despite their different underlying assumptions and algorithmic struc¬ 
ture, successfully match almost the same patches (see third row in Figure 3) - and the most 
complementary descriptor to the leading rSIFT is its gradient-reversal-insensitive version - 
hrSIFT. 

The results confirming the domination of SIFT-based methods are in agreement with [E3] 
and [O] despite the fact that they adopted a rather different evaluation methodology. How¬ 
ever, we could not confirm clear superiority of the SSIM over SymFeat descriptors, which 
could be explained by the fact that the SSIM descriptor was designed for use only with the 
SSIM detector. Detectors evaluation. The following detectors are compared: MSER [I23], 
DoG [O], Hessian-Affine [El] (implementation [El]), FOCI [E3], IIDOG [O], WADE [El], 
WaSH [O], SURF [□], SFOP [O], AKAZE[0]. We focus on getting a reliable answer 
to the "match/non-match" question in real image pairs. Therefore the performance crite¬ 
rion is the number of successfully matched pairs using the best combination of descriptors 
(see Section Descriptors evaluation ) - rSIET and hrSIET. Matching is done as in Algo¬ 
rithm 1 except that no view synthesis is performed. Image pairs are considered matched if 
>15 correct inliers to a homography are found. Since the Lost-in-past dataset contains 2300 
matchable image pairs, which is unfeasible for direct matching, we have selected a subset of 
172 medium-challenging image pairs. Other datasets are used fully. 

Adaptive threshold of the detector response. One of the main problems in matching of day 
to night and infrared images is the low number of detected features. The problem is acute 
in dark low contrast images in the WgsBS and MMS [□] datasets. A possible approach 
addressing the problem is iiDoG [O] where the difference of Gaussians is normalized by 
sum of Gaussians. It works well, but cannot be easily applied for other types of detectors, 
i.e. MSER. 

Instead, we propose to use the following adaptive thresholding for all feature detectors. 
Eirst, all local extrema of the response function are detected (i.e. no thresholding takes 
place). Next, the detected features are sorted according to the response magnitude. If the 
number of detected features with response magnitude > 0 is greater than a given threshold 
Rmin. these are output and the algorithm terminates (this is the standard approach). If there 
is not enough features above the threshold, top Rmin features our output. 

Discussion and results. The performance of the proposed WxBS-M matcher is compared 
with it state-of-art matchers: ASIET [E3], Dual Bootstrap (DBstrap) [D] and MODS [E3] 
on various WxBS problems. 

The results are summarized in Table 3. Note that the state-of-the-art matchers were not 
able to match almost any image pair which combines more nuisance factors. The proposed 
WxBS-M matcher shows much better performance, but still is not able to solve even half of 
the new dataset pairs. 

Results in Table 3 confirm that the proposed adaptive thresholding strategy works as 
well as, or even better, than iiDoG for DoG, but it is 1.5 times faster. It also significantly 
improves results of the MSER and Hessian-Affine, even when main the nuisance is in the 
viewing geometry (EVD dataset). 
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Table 3: Detector evaluation results. The number of matched image pairs (left) and the 
average running time (right). The FOCI detector is run through MS Windows simulator 
wine, the time includes a big overhead. 


Alg. 

EE 

EVD 

MMS 

WgaBS 

WgalBS 

WglBS 

WgsBS 

WlaBS 

Past 

OxAff 

SymB 

GDB 


# 

time 

# 

time 

# 

time 

# 

time 

# 

time 

# 

time 

# 

time 

# 

time 

# 

time 

# 

time 

# 

time 

# 

time 


33 

w 

15 

w 

100 

w 

5 

[*] 

8 

w 

9 

[*] 

5 

w 

4 


172 


40 

[*] 

46 


22 

w 

Threshold adaptation 

MSER 

16 

1.4 

■ 

1.4 

1 

0.3 

0 

2.0 

0 

1.3 

0 

1.3 

0 

0.8 

1 

1.2 

8 

1.3 

40 

3.5 

23 

2.4 

9 

2.4 

AdMSER 

25 

3.4 

4.0 

6 

1.0 

0 

4.0 

0 

3.2 

0 

3.3 

0 

1.4 

1 

2.6 

11 

2.9 

40 

5.7 

26 

4.6 

13 

6.9 

DoG 

29 

2.3 

0 

2.8 

10 

0.8 

0 

2.7 

0 

2.3 

0 

2.1 

0 

1.0 

1 

2.4 

13 

2.0 

38 

4.8 

29 

2.7 

12 

4.7 

iiDoG 

29 

3.1 

0 

3.0 

11 

1.2 

0 

3.2 

0 

2.9 

0 

2.8 

0 

1.2 

1 

2.5 

13 

2.2 

38 

8.0 

29 

2.9 

12 

6.1 

AdDoG 

29 

2.6 

0 

3.4 

11 

1.2 

0 

3.3 

0 

3.0 

0 

3.0 

0 

1.5 

1 

2.7 

13 

2.7 

38 

4.1 

30 

3.0 

12 

4.8 

HesAf 

32 

4.6 

1 

5.2 

15 

1.2 

0 

5.5 

0 

3.8 

0 

4.2 

0 

2.0 

1 

3.6 

24 

4.0 

40 

11. 

35 

5.8 

17 

9.1 

AdHesAf 

33 

5.7 

2 

7.6 

35 

2.9 

0 

7.2 

1 

6.5 

0 

6.0 

0 

3.2 

1 

4.9 

25 

5.4 

40 

10. 

35 

7.2 

18 

13. 

Other detectors 

WaSH 

0 

1.8 

0 

5.4 

0 

0.6 

0 

2.8 

0 

2.5 

0 

1.4 

0 

1.8 

0 

1.2 

0 

1.9 

24 

4.1 

3 

2.8 

3 

6.9 

ORB 

3 

4.1 

0 

3.6 

1 

0.8 

0 

2.8 

0 

2.7 

0 

3.6 

0 

1.6 

0 

2.8 

1 

2.3 

28 

8.7 

5 

3.0 

3 

6.1 

SURE 

27 

2.3 

0 

2.4 

7 

1.0 

0 

2.5 

0 

1.9 

0 

2.1 

0 

0.9 

1 

1.4 

10 

1.9 

38 

5.8 

31 

2.9 

15 

4.0 

AKAZE 

28 

4.3 

0 

3.6 

10 

0.8 

1 

4.7 

0 

3.4 

0 

4.0 

0 

1.3 

1 

2.7 

25 

3.6 

38 

13. 

35 

5.6 

17 

6.4 

EOCI 

29 

12. 

0 

39. 

14 

11. 

1 

32. 

0 

29. 

0 

29. 

0 

20. 

1 

29. 

21 

13. 

38 

35. 

35 

27. 

17 

45. 

SEOP 

25 

11. 

0 

16. 

12 

4.7 

0 

12. 

0 

10. 

0 

10. 

0 

9.2 

0 

7.5 

11 

12. 

36 

15. 

24 

11. 

8 

17. 

WADE 

16 

14. 

0 

20. 

0 

3.4 

0 

58. 

0 

11. 

0 

14. 

0 

7.9 

1 

8.3 

20 

23. 

34 

60. 

34 

46. 

13 

77. 

State-of-art matchers 

ASIET 

23 

27. 

□] 

12. 

18 

3.2 

0 

52. 

0 

32. 

0 

35. 

0 

12. 

1 

30. 


32. 

40 

102 

27 

14. 

15 

41. 

MODS 

33 

4.8 

m 

11. 

27 

11. 

2 

41. 

2 
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6 Conclusions 

We have presented a new problem - the wide multiple baseline stereo (WxBS) - which 
considers matching of images that simultaneously differ in more than one image acquisition 
factor such as viewpoint, illumination, sensor type or where object appearance changes sig¬ 
nificantly, e.g. over time. A new dataset with the ground truth for evaluation of matching 
algorithms has been introduced and will be made public. 

We have extensively tested a large set of popular and recent detectors and descriptors 
and show than the combination of RootSIFT and HalfRootSIFT as descriptors with MSER 
and Hessian-Affine detectors works best for many different nuisance factors. We show that 
simple adaptive thresholding improves Hessian-Affine, DoG, MSER (and possibly other) 
detectors and allows to use them on infrared and low contrast images. 

A novel matching algorithm for addressing the WxBS problem has been introduced. 
We have shown experimentally that the WxBS-M matcher dominantes the state-of-the-art 
methods both on both the new and existing datasets. 
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